Skip to content

observability: add the kube-state-metrics addon (+ operator CR-state metrics)#44

Merged
stxkxs merged 1 commit into
mainfrom
observability-kube-state-metrics
Jun 15, 2026
Merged

observability: add the kube-state-metrics addon (+ operator CR-state metrics)#44
stxkxs merged 1 commit into
mainfrom
observability-kube-state-metrics

Conversation

@stxkxs

@stxkxs stxkxs commented Jun 15, 2026

Copy link
Copy Markdown
Member

grafana-agent statically scrapes kube-state-metrics.kube-system.svc:8080, but no addon ever shipped kube-state-metrics — so that scrape hit nothing and every kube_* / kube_customresource_* panel silently no-data'd: the cilium/cert-manager/eso dashboards and the seven agent persona dashboards + operator-slo CR-status alerts.

Adds the addon (prometheus-community/kube-state-metrics 7.5.1) in kube-system with fullnameOverride: kube-state-metrics so the existing scrape target resolves — grafana-agent unchanged. customResourceState.config carries the operator's CR-state definitions (Platform / Tenant / BudgetPolicy / AgentFleet / EvalSuite); rbac.extraRules grants list/watch on those CRDs.

Inlined, not mounted (chosen design): the operator's ConfigMap is in another namespace at a later sync wave, so mounting it would couple KSM's startup to it and need a restart. It mirrors the operator chart's files/slo/customresourcestatemetrics.yaml.

Validated: helm template renders clean (service name, the flag, all 5 GVKs, RBAC for 3 API groups); task validate passes.

Part of #33 — the prod-alerting flip stays blocked on the pagerduty/slack Secrets.

…metrics)

grafana-agent statically scrapes kube-state-metrics.kube-system.svc:8080, but no
addon ever shipped kube-state-metrics — so that scrape hit nothing and every
kube_* / kube_customresource_* panel silently no-data'd: the cilium/cert-manager/
eso dashboards on the standard metrics, and the seven agent persona dashboards +
operator-slo CR-status alerts on the custom-resource ones.

Add the addon (prometheus-community/kube-state-metrics 7.5.1, app 2.19.1) in
kube-system with fullnameOverride: kube-state-metrics so the existing scrape
target resolves — grafana-agent unchanged. customResourceState.config carries the
eks-agent-platform operator's CR-state definitions (Platform / Tenant /
BudgetPolicy / AgentFleet / EvalSuite status_phase + status_field + condition),
and rbac.extraRules grants KSM list/watch on those CRDs so kube_customresource_*
series actually emit.

The CR-state config is inlined here rather than mounted from the operator chart's
ConfigMap on purpose: the operator runs at a later sync wave in a different
namespace, so mounting it would couple KSM's startup to the operator and need a
restart once the ConfigMap appeared. Observability scrape config belongs in the
observability repo; keep it in step with the operator chart's
files/slo/customresourcestatemetrics.yaml when its CRD status surface changes.

Part of #33 — the second half (flip slo.alerting in production) stays blocked on
the pagerduty-platform + slack-webhook-* Secrets the AlertmanagerConfig receivers
reference.
@github-actions

Copy link
Copy Markdown

CI Results

Check Status
YAML Lint
Environment Kustomize Build
dev
staging
production

All validations passed.

@stxkxs stxkxs merged commit 5cbb027 into main Jun 15, 2026
5 checks passed
@stxkxs stxkxs deleted the observability-kube-state-metrics branch June 15, 2026 04:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant